Ensembel Techniques
Parkinson’s Disease Data Analysis Using Ensembel Techniques
- Objective:
- Domain: Medicine
- Data Description & Context:
- Attribute Information:
- Learning Outcomes:
- Exploratory Data Analysis
- Observation:
- Note:
- Univariate & bivariate analysis:
- Plotting vocal fundamental frequency data columns
- Observation:
- Plotting several measures of variation in fundamental frequency
- Observation:
- Plotting several measures of variation in amplitude
- Plotting:
- Observation:
- Let's do the pair plot to see the relation ship with each variable and influence of status variable!
- Observations
- Data Processing
- Model creation
- Let's split the dataset into training and test set in the ratio of 70:30
- Let's train Logistic Regression, Naive Bayes’, SVM, k-NN algorithms and check the accuracies on the test data
- Let's plot the accuracy results of all the models
- Result:
- Observation:
- Now let's train a meta-classifier and check the accuracy
- Observation:
- Ensemble Techniques:
- Let's quickly construct and visualize a decision tree!
- Applying the Random forest model to check the accuracy of the Model
- Applying the Bagging Classifier Algorithm
- Boosting Classifier
- Applying the Adaboost and GradientBoost Ensemble Algorithms
- Observation:
- Let's include couple of other ensemble classifier and more base classifiers
- Observation:
- Observation:
- Conclution:
Submitted by: Dr. Karthick Lakshmanan
Parkinson’s Disease (PD) is a degenerative neurological disorder marked by decreased dopamine levels in the brain. It manifests itself through a deterioration of movement, including the presence of tremors and stiffness. There is commonly a marked effect on speech, including dysarthria (difficulty articulating sounds), hypophonia (lowered volume), and monotone (reduced pitch range). Additionally, cognitive impairments and changes in mood can occur, and risk of dementia is increased.
Traditional diagnosis of Parkinson’s Disease involves a clinician taking a neurological history of the patient and observing motor skills in various situations. Since there is no definitive laboratory test to diagnose PD, diagnosis is often difficult, particularly in the early stages when motor effects are not yet severe. Monitoring progression of the disease over time requires repeated clinic visits by the patient. An effective screening process, particularly one that doesn’t require a clinic visit, would be beneficial. Since PD patients exhibit characteristic vocal features, voice recordings are a useful and non-invasive tool for diagnosis. If machine learning algorithms could be applied to a voice recording dataset to accurately diagnosis PD, this would be an effective screening step prior to an appointment with a clinician
Attribute Information:
- name - ASCII subject name and recording number
- MDVP:Fo(Hz) - Average vocal fundamental frequency
- MDVP:Fhi(Hz) - Maximum vocal fundamental frequency
- MDVP:Flo(Hz) - Minimum vocal fundamental frequency
- MDVP:Jitter(%),MDVP:Jitter(Abs),MDVP:RAP,MDVP:PPQ,Jitter:DDP - Several
- measures of variation in fundamental frequency
- MDVP:Shimmer,MDVP:Shimmer(dB),Shimmer:APQ3,Shimmer:APQ5,MDVP:APQ,S
- himmer:DDA - Several measures of variation in amplitude
- NHR,HNR - Two measures of ratio of noise to tonal components in the voice
- status - Health status of the subject (one) - Parkinson's, (zero) - healthy
- RPDE,D2 - Two nonlinear dynamical complexity measures
- DFA - Signal fractal scaling exponent
- spread1,spread2,PPE - Three nonlinear measures of fundamental frequency
- variation 9. car name: string (unique for each instance)
import numpy as np #to import numpy library as np
import pandas as pd #to import pandas library as pd
import matplotlib.pyplot as plt #to import matplotlip libraries pyplot as plt, this is for plotting feature
%matplotlib inline
import scipy.stats as stats #performing statistics using scipy library. it is imported as stats
import seaborn as sns #data visualization library based on matplotlib, imported as sns
sns.set(color_codes=True) # setting colorcodes as per matplotlib
import warnings
warnings.filterwarnings("ignore") # to ignore the warnings
np.set_printoptions(suppress=True)
# list of modules from sklearn library
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.metrics import accuracy_score, classification_report, roc_auc_score, roc_curve, confusion_matrix
#sklearn library for different classifiers
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn import naive_bayes
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn import model_selection
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import Ridge
from sklearn.preprocessing import PolynomialFeatures
from sklearn import svm,model_selection, tree, linear_model, neighbors, naive_bayes, ensemble, discriminant_analysis, gaussian_process
file = open('Parkinsons_names', 'r') # access and read the description file!
lines = file.readlines()
for line in lines:
print(line.strip())
data = pd.read_table("Parkinsons_data", sep=",") # data file is a comma seperated file!
data.head()
data.shape
data.info()
data.isnull().sum()
Note: since we get 0 for all the individual columns above, each element of the dataframe column has some value, so it is non-empty or NaN
data.nunique()
data.describe().T
Observation:
-
For MDVP:Fo(Hz), MDVP:Fhi(Hz), MDVP:Flo(Hz), MDVP:Jitter(%), MDVP:Jitter(Abs), MDVP:RAP, MDVP:PPQ, Jitter:DD, MDVP:Shimmer, MDVP:Shimmer(dB), MDVP:APQ, Shimmer:DDA, NHR, PPE, the mean is slightly higher than median the data looks tailed more toward right.. Q3 and Max difference is significanlty large compared to Q1 and Min difference.
-
For Shimmer:APQ3, Shimmer:APQ5, the mean is slightly higher that median and data is sligtly right tailed
-
For HNR, mean is lessor than median and data is slightly tailed towards left side.
-
For RPDE, DFA, spread1, spread2, D2, the data looks normal having mean and median nearly same.
Note:
- target variable 'status' is of the type int64, this should be converted into categorical variable for analysis
- there are 23 attributes. The "name" attribute is a "recording number" and is not required for analysis and shall be dropped
- the data columns are of different units and scale, Scaling is needed to maintain the same scale for modeling
data-cleaning need:
- data scaling is required
- outliers treament is required
- few data have multi gausian distribution..
- target variable 'status' is of type int64.. this need to be converted into categorical variable for analysis
data["status"].value_counts()
Note: health status of the subject:
- 1 = Parkinson's
- 0 = healthy
sns.countplot(data['status'])
plt.xlabel('PD status')
datapd = data['status'].value_counts().to_frame()
datapd['percent'] = data['status'].value_counts(normalize=True)*100
print(datapd)
data1 = data.drop(['name'], axis=1)
Note: Dropping the "name" column, since it is unsignificant and not required for building model
plt.figure(figsize= (10,30))
plt.subplot(6,2,1)
sns.distplot(data1[["MDVP:Fhi(Hz)"]], color='red')
plt.xlabel('MDVP:Fhi(Hz)')
plt.subplot(6,2,2)
sns.boxplot(data1[["MDVP:Fhi(Hz)"]], color='red')
plt.xlabel('MDVP:Fhi(Hz)')
plt.subplot(6,2,3)
sns.distplot(data1[["MDVP:Flo(Hz)"]], color='red')
plt.xlabel('MDVP:Flo(Hz)')
plt.subplot(6,2,4)
sns.boxplot(data1[["MDVP:Flo(Hz)"]], color='red')
plt.xlabel('MDVP:Flo(Hz)')
plt.subplot(6,2,5)
sns.distplot(data1[["MDVP:Fo(Hz)"]], color='red')
plt.xlabel('MDVP:Fo(Hz)')
plt.subplot(6,2,6)
sns.boxplot(data1[["MDVP:Fo(Hz)"]], color='red')
plt.xlabel('MDVP:Fo(Hz)')
plt.figure(figsize= (10,40))
plt.subplot(10,2,1)
sns.distplot(data1[["MDVP:Jitter(%)"]], color='blue')
plt.xlabel('MDVP:Jitter(%)')
plt.subplot(10,2,2)
sns.boxplot(data1[["MDVP:Jitter(%)"]], color='blue')
plt.xlabel('MDVP:Jitter(%)')
plt.subplot(10,2,3)
sns.distplot(data1[["MDVP:Jitter(Abs)"]], color='blue')
plt.xlabel('MDVP:Jitter(Abs)')
plt.subplot(10,2,4)
sns.boxplot(data1[["MDVP:Jitter(Abs)"]], color='blue')
plt.xlabel('MDVP:Jitter(Abs)')
plt.subplot(10,2,5)
sns.distplot(data1[["MDVP:RAP"]], color='blue')
plt.xlabel('MDVP:RAP')
plt.subplot(10,2,6)
sns.boxplot(data1[["MDVP:RAP"]], color='blue')
plt.xlabel('MDVP:RAP')
plt.subplot(10,2,7)
sns.distplot(data1[["MDVP:PPQ"]], color='blue')
plt.xlabel('MDVP:PPQ')
plt.subplot(10,2,8)
sns.boxplot(data1[["MDVP:PPQ"]], color='blue')
plt.xlabel('MDVP:PPQ')
plt.subplot(10,2,9)
sns.distplot(data1[["Jitter:DDP"]], color='blue')
plt.xlabel('Jitter:DDP')
plt.subplot(10,2,10)
sns.boxplot(data1[["Jitter:DDP"]], color='blue')
plt.xlabel('Jitter:DDP')
plt.figure(figsize= (10,25))
plt.subplot(10,2,1)
sns.distplot(data1[["MDVP:Shimmer"]], color='green')
plt.xlabel('MDVP:Shimmer')
plt.subplot(10,2,2)
sns.boxplot(data1[["MDVP:Shimmer"]], color='green')
plt.xlabel('MDVP:Shimmer')
plt.subplot(10,2,3)
sns.distplot(data1[["MDVP:Shimmer(dB)"]], color='green')
plt.xlabel('MDVP:Shimmer(dB)')
plt.subplot(10,2,4)
sns.boxplot(data1[["MDVP:Shimmer(dB)"]], color='green')
plt.xlabel('MDVP:Shimmer(dB)')
plt.subplot(10,2,5)
sns.distplot(data1[["Shimmer:APQ3"]], color='green')
plt.xlabel('Shimmer:APQ3')
plt.subplot(10,2,6)
sns.boxplot(data1[["Shimmer:APQ3"]], color='green')
plt.xlabel('Shimmer:APQ3')
plt.subplot(10,2,7)
sns.distplot(data1[["Shimmer:APQ5"]], color='green')
plt.xlabel('Shimmer:APQ5')
plt.subplot(10,2,8)
sns.boxplot(data1[["Shimmer:APQ5"]], color='green')
plt.xlabel('Shimmer:APQ5')
plt.subplot(10,2,9)
sns.distplot(data1[["MDVP:APQ"]], color='green')
plt.xlabel('MDVP:APQ')
plt.subplot(10,2,10)
sns.boxplot(data1[["MDVP:APQ"]], color='green')
plt.xlabel('MDVP:APQ')
plt.subplot(10,2,11)
sns.distplot(data1[["Shimmer:DDA"]], color='green')
plt.xlabel('Shimmer:DDA')
plt.subplot(10,2,12)
sns.boxplot(data1[["Shimmer:DDA"]], color='green')
plt.xlabel('Shimmer:DDA')
plt.figure(figsize= (10,35))
plt.subplot(10,2,1)
sns.distplot(data1[["NHR"]], color='cyan')
plt.xlabel('NHR')
plt.subplot(10,2,2)
sns.boxplot(data1[["NHR"]], color='cyan')
plt.xlabel('NHR')
plt.subplot(10,2,3)
sns.distplot(data1[["HNR"]], color='cyan')
plt.xlabel('HNR')
plt.subplot(10,2,4)
sns.boxplot(data1[["HNR"]], color='cyan')
plt.xlabel('HNR')
plt.subplot(10,2,5)
sns.distplot(data1[["RPDE"]], color='yellow')
plt.xlabel('RPDE')
plt.subplot(10,2,6)
sns.boxplot(data1[["RPDE"]], color='yellow')
plt.xlabel('RPDE')
plt.subplot(10,2,7)
sns.distplot(data1[["D2"]], color='yellow')
plt.xlabel('D2')
plt.subplot(10,2,8)
sns.boxplot(data1[["D2"]], color='yellow')
plt.xlabel('D2')
plt.subplot(10,2,9)
sns.distplot(data1[["DFA"]], color='pink')
plt.xlabel('DFA')
plt.subplot(10,2,10)
sns.boxplot(data1[["DFA"]], color='pink')
plt.xlabel('DFA')
Plotting three nonlinear measures of fundamental frequency variation
- spread1
- spread2
- PPE
plt.figure(figsize= (10,30))
plt.subplot(6,2,1)
sns.distplot(data1[["spread1"]], color='orange')
plt.xlabel('spread1')
plt.subplot(6,2,2)
sns.boxplot(data1[["spread1"]], color='orange')
plt.xlabel('spread1')
plt.subplot(6,2,3)
sns.distplot(data1[["spread2"]], color='orange')
plt.xlabel('spread2')
plt.subplot(6,2,4)
sns.boxplot(data1[["spread2"]], color='orange')
plt.xlabel('spread2')
plt.subplot(6,2,5)
sns.distplot(data1[["PPE"]], color='orange')
plt.xlabel('PPE')
plt.subplot(6,2,6)
sns.boxplot(data1[["PPE"]], color='orange')
plt.xlabel('PPE')
sns.boxplot(data=data1,orient='h')
sns.pairplot(data1, hue="status", palette="husl")
Note: since the plot looks complex, let's zoom in to sections!
sns.pairplot(data1[['MDVP:Fo(Hz)','MDVP:Fhi(Hz)','MDVP:Flo(Hz)','status']], palette='RdBu', hue="status")
sns.pairplot(data1[['MDVP:Jitter(%)', 'MDVP:Jitter(Abs)', 'MDVP:RAP', 'MDVP:PPQ', 'Jitter:DDP', 'MDVP:Shimmer', 'MDVP:Shimmer(dB)',
'Shimmer:APQ3', 'Shimmer:APQ5', 'MDVP:APQ','Shimmer:DDA', 'status']], palette='RdBu', hue="status")
sns.pairplot(data1[[ 'NHR', 'HNR', 'status', 'RPDE','DFA', 'spread1', 'spread2',
'D2', 'PPE']], palette='RdBu', hue="status")
corr = data1.corr()
plt.figure(figsize=(15,8))
ax = sns.heatmap(corr[(corr >= 0.1) | (corr <= -0.1)], annot=True, cmap ='coolwarm',linecolor ='white', linewidths = 1)
bottom, top = ax.get_ylim()
ax.set_ylim(bottom + 0.5, top - 0.5)
Observations
- MDVP:Jitter(%) has strongly correlated with MDVP:Jitter(Abs), MDVP:RAP, MDVP:PPQ, Jitter:DDP
- MDVP:Shimmer(dB) is correlated possitively with MDVP:APQ, Shimmer:APQ3, Shimmer:APQ5, Shimmer:DDA
- HNR. NHR are highly negatively corelated.
- The distributions are more separable on some dimensions such as spread1, spread2, PPE, D2
- Attributes with multi gausian distributions: MDVP:Fo(Hz), MDVP:Flo(Hz), DFA
Quick inference
It is expected that the highly related measures are showing high correlation like, measures of variation in fundamental frequency (MDVP:Jitter(%), MDVP:Jitter(Abs), MDVP:RAP, MDVP:PPQ, Jitter:DDP)and measures of variation in amplitude(MDVP:Shimmer, MDVP:Shimmer(dB), Shimmer:APQ3, Shimmer:APQ5, MDVP:APQ, Shimmer:DDA)
with relavance to the target variable:
- spread1 variable has a possitive correlation with the status variable, also PPE and Spread2 variables but all are not so strong
- MDVP:Fo(Hz), MDVP:Fhi(Hz), MDVP:Flo(Hz), HNR - negative coorelation with target variable
- Except HNR other variables have formed a moderate / strong correlation with the status field. Hence can influence the outcome of status variable based analysis
data2 = data1.copy()
Let's remove the outliers from the columns!
def upper_outliers(df, colname):
data = df[colname]
irq = np.quantile(a=data,q=0.75)-np.quantile(a=data,q=0.25)
ub = np.quantile(a=data,q=0.75) + 1.5 * irq
df[colname] = df[colname].apply(lambda x: ub if x > ub else x)
# function to replace lower outlier with their min value of (Q1 - 1.5 * IRQ)
def lower_outliers(df, colname):
data = df[colname]
irq = np.quantile(a=data,q=0.75)-np.quantile(a=data,q=0.25)
lb = np.quantile(a=data,q=0.25) - 1.5 * irq
df[colname] = df[colname].apply(lambda x: lb if x < lb else x)
upper_outliers(data2, "MDVP:Fhi(Hz)")
upper_outliers(data2, "MDVP:Flo(Hz)")
upper_outliers(data2, "MDVP:Jitter(%)")
upper_outliers(data2, "MDVP:Jitter(Abs)")
upper_outliers(data2, "MDVP:RAP")
upper_outliers(data2, "MDVP:PPQ")
upper_outliers(data2, "Jitter:DDP")
upper_outliers(data2, "MDVP:Shimmer")
upper_outliers(data2, "MDVP:Shimmer(dB)")
upper_outliers(data2, "Shimmer:APQ3")
upper_outliers(data2, "Shimmer:APQ5")
upper_outliers(data2, "MDVP:APQ")
upper_outliers(data2, "Shimmer:DDA")
upper_outliers(data2, "NHR")
upper_outliers(data2, "PPE")
lower_outliers(data2, "HNR")
plt.figure(figsize=(15,8))
plt.subplot(2,2,1)
ax = sns.boxplot(data=data1)
plt.xticks(rotation=45, ha='right')
plt.title('Before : Data without outlier treatment')
bottom, top = ax.get_ylim()
ax.set_ylim(bottom + 0.5, top - 0.5)
plt.subplot(2,2,2)
ax = sns.boxplot(data=data2)
plt.xticks(rotation=45, ha='right')
plt.title('After : Data with outlier treatment')
bottom, top = ax.get_ylim()
ax.set_ylim(bottom + 0.5, top - 0.5)
Now Let's scale the data
from sklearn.preprocessing import StandardScaler
standardScaler = StandardScaler()
data3 = data2.copy()
index_name = ['MDVP:Fo(Hz)','MDVP:Fhi(Hz)','MDVP:Flo(Hz)','MDVP:Jitter(Abs)','MDVP:PPQ','MDVP:Shimmer','Shimmer:APQ3','Shimmer:APQ5','MDVP:APQ','NHR','HNR','RPDE','DFA', 'spread1','spread2','D2','PPE']
data3[index_name] = standardScaler.fit_transform(data3[index_name])
plt.figure(figsize=(15,8))
plt.subplot(2,2,1)
ax = sns.boxplot(data=data2)
plt.xticks(rotation=45, ha='right')
plt.title('Before : Data without scaling')
bottom, top = ax.get_ylim()
ax.set_ylim(bottom + 0.5, top - 0.5)
plt.subplot(2,2,2)
ax = sns.boxplot(data=data3)
plt.xticks(rotation=45, ha='right')
plt.title('After : Data with scaling')
bottom, top = ax.get_ylim()
ax.set_ylim(bottom + 0.5, top - 0.5)
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.metrics import matthews_corrcoef
x=data3.drop('status', axis=1)
y=data3.loc[:,'status']
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size = 0.3,random_state = 7)
display(x_train.shape, x_test.shape, y_train.shape, y_test.shape)
best_accuracy = []
score_list = []
best_score = 0
best_k = 0
for each in range(1,15):
knn = KNeighborsClassifier(n_neighbors = each)
knn.fit(x_train,y_train)
score_list.append(knn.score(x_test,y_test))
if (knn.score(x_test,y_test) > best_score):
best_score = knn.score(x_test,y_test)
best_k = each
plt.plot(range(1,15), score_list)
plt.xlabel("k values")
plt.ylabel("Accuracy")
plt.show()
print("The best accuracy we got is ", best_score)
print("Best accuracy's k value is ", best_k)
best_accuracy.append(best_score)
x_train = x_train.T
x_test = x_test.T
y_train = y_train.T
y_test = y_test.T
lr = LogisticRegression()
lr.fit(x_train.T,y_train.T)
print("test accuracy {}".format(lr.score(x_test.T,y_test.T)))
best_accuracy.append(lr.score(x_test.T,y_test.T))
svc_scores = []
kernels = ['linear', 'poly', 'rbf', 'sigmoid']
for i in range(len(kernels)):
svc_classifier = SVC(kernel = kernels[i])
svc_classifier.fit(x_train, y_train)
svc_scores.append(svc_classifier.score(x_test, y_test))
plt.bar(kernels, svc_scores)
for i in range(len(kernels)):
plt.text(i, svc_scores[i], svc_scores[i])
plt.xlabel('Kernels')
plt.ylabel('Scores')
plt.title('Support Vector Classifier scores for different kernels')
Note: the "rbf" kernel shows maximum scores, so let's this kernel below!
x_train = x_train.T
x_test = x_test.T
y_train = y_train.T
y_test = y_test.T
svm = SVC(kernel="rbf", random_state = 42)
svm.fit(x_train,y_train)
print("Accuracy of SVM: ",svm.score(x_test,y_test))
best_accuracy.append(svm.score(x_test,y_test))
nb = GaussianNB()
nb.fit(x_train,y_train)
print("Accuracy of NB: ", nb.score(x_test,y_test))
best_accuracy.append(nb.score(x_test,y_test))
dt_scores = []
for i in range(1, len(X.columns) + 1):
dt_classifier = DecisionTreeClassifier(max_features = i, random_state = 0)
dt_classifier.fit(x_train, y_train)
dt_scores.append(dt_classifier.score(x_test, y_test))
plt.plot([i for i in range(1, len(X.columns) + 1)], dt_scores, color = 'red')
for i in range(1, len(X.columns) + 1):
plt.text(i, dt_scores[i-1], (i, dt_scores[i-1]))
plt.xticks([i for i in range(1, len(X.columns) + 1)])
plt.xlabel('Max features')
plt.ylabel('Scores')
plt.title('Decision Tree Classifier scores for different number of maximum features')
Note: as seen in the above plot, a max feature of 20 is applyed for the model building
dt = DecisionTreeClassifier(max_features =20)
dt.fit(x_train,y_train)
print("Accuracy of Decision Tree: ", dt.score(x_test,y_test))
best_accuracy.append(dt.score(x_test,y_test))
sv_ml = [ "KNN", "Logistic Regression","SVM","Naive Bayes", "Decision Tree"]
plt.figure(figsize=(15,5))
sns.barplot(x = sv_ml, y = best_accuracy)
plt.xticks(rotation= 30)
plt.xlabel('Accuracy')
plt.ylabel('Supervised Learning Types')
plt.title('Supervised Learning Types v Accuracy')
plt.show()
results1 = []
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.metrics import matthews_corrcoef
num_folds = 10
num_instances = len(x_train)
seed = 8
scoring = 'accuracy'
models = []
models.append(('LR', LogisticRegression()))
models.append(('KNN', KNeighborsClassifier(n_neighbors=3)))
models.append(('DT', DecisionTreeClassifier(max_features =20)))
models.append(('NB', GaussianNB()))
models.append(('SVC', SVC(kernel="rbf")))
results = []
names = []
print("Scores for each algorithm:")
for name, model in models:
kfold = KFold(num_instances, random_state = seed)
cv_results = cross_val_score(model, x_train, y_train, cv = kfold, scoring = scoring)
results.append(cv_results)
names.append(name)
model.fit(x_train, y_train)
predictions = model.predict(x_test)
print(name, accuracy_score(y_test, predictions)*100)
results1.append(accuracy_score(y_test, predictions))
print(matthews_corrcoef(y_test, predictions))
print()
sv_ml2 = ["Logistic Regression", "KNN", "Decision Tree", "Naive Bayes","SVM"]
plt.figure(figsize=(15,5))
sns.barplot(x = sv_ml2, y = results1)
plt.xticks(rotation= 30)
plt.xlabel('Accuracy')
plt.ylabel('Supervised Learning Types')
plt.title('Supervised Learning Types v Accuracy')
plt.show()
colors = ['red','green','blue','cyan','purple']
labels = sv_ml2
explode = [0,0,0,0,0]
sizes = results1
# visual
plt.figure(figsize = (7,7))
plt.pie(sizes, labels=labels, explode=explode, colors=colors, autopct='%1.1f%%')
plt.title('Comparison of Accuracies',color = 'brown',fontsize = 15)
plt.show()
Extra: Voting based Ensemble learning
models #array of models created before
from sklearn.ensemble import VotingClassifier
ensemble = VotingClassifier(models)
results = model_selection.cross_val_score(ensemble, x, y, cv=kfold)
print('Acuracy based on combined models : %.2f' %(results.mean()))
from mlxtend.classifier import StackingClassifier
from mlxtend.plotting import plot_learning_curves
from mlxtend.plotting import plot_decision_regions
import matplotlib.gridspec as gridspec
from sklearn.model_selection import cross_val_score, train_test_split
import itertools
from sklearn.decomposition import PCA
clf1 = GaussianNB()
clf2 = SVC()
clf3 = DecisionTreeClassifier()
lr = LogisticRegression()
sclf = StackingClassifier(classifiers=[clf1, clf2, clf3],
meta_classifier=lr)
label = ['NB', 'SVC', 'DT', 'Stacking Classifier']
clf_list = [clf1, clf2, clf3, sclf]
fig = plt.figure(figsize=(10,8))
gs = gridspec.GridSpec(2, 2)
grid = itertools.product([0,1],repeat=2)
clf_cv_mean = []
clf_cv_std = []
for clf, label, grd in zip(clf_list, label, grid):
scores = cross_val_score(clf, x_test, y_test, cv=3, scoring='accuracy')
print ("Accuracy: %.2f (+/- %.2f) [%s]" %(scores.mean(), scores.std(), label))
clf_cv_mean.append(scores.mean())
clf_cv_std.append(scores.std())
pca = PCA(n_components = 2)
X_train2 = pca.fit_transform(x_test)
clf.fit(X_train2, y_test)
ax = plt.subplot(gs[grd[0], grd[1]])
fig = plot_decision_regions(np.array(X_train2), np.array(y_test), clf=clf, legend=2)
plt.title(label)
plt.show()
plt.figure()
(_, caps, _) = plt.errorbar(range(4), clf_cv_mean, yerr=clf_cv_std, c='blue', fmt='-o', capsize=5)
for cap in caps:
cap.set_markeredgewidth(1)
plt.xticks(range(4), ['KNN', 'SVC', 'DT','Stacking'])
plt.ylabel('Accuracy'); plt.xlabel('Classifier'); plt.title('Stacking Ensemble');
plt.show()
plt.figure()
plot_learning_curves(x_train, y_train, x_test, y_test, sclf, print_model=False, style='ggplot')
plt.show()
Observation:
- Among the choosen models stacking classifier performed best with 0.88 accuracy
- choice of the models influence the stacking ensemble
Inference:
The stacking ensemble consists of Naive bayes, Random Forest, and DT as base classifiers whose predictions are combined by Logistic Regression as a meta-classifier. We can see the blending of decision boundaries achieved by the stacking classifier. The figure also shows that stacking achieves higher accuracy than individual classifiers.
dt_model = DecisionTreeClassifier(max_features =20,criterion='entropy',max_depth=6,random_state=100,min_samples_leaf=5)
dt_model.fit(x_train, y_train)
dt_model.score(x_test , y_test)
y_predict = dt_model.predict(x_test)
print(bgcl.score(x_test , y_test))
cm=confusion_matrix(y_test, y_predict,labels=[0, 1])
df_cm = pd.DataFrame(cm, index = [i for i in ["No","Yes"]],
columns = [i for i in ["No","Yes"]])
plt.figure(figsize = (7,5))
ax=sns.heatmap(df_cm, annot=True ,fmt='g')
bottom, top = ax.get_ylim()
ax.set_ylim(bottom + 0.5, top - 0.5)
count_misclassified = (y_test != y_pred).sum()
print('Misclassified samples: {}'.format(count_misclassified))
from IPython.display import Image
from sklearn import tree
from os import system
train_char_label = ['No', 'Yes']
pd_tree_regularized = open('pd_tree_regularized.dot','w')
dot_data = tree.export_graphviz(dt_model, out_file= pd_tree_regularized , feature_names = list(x_train), class_names = list(train_char_label))
pd_tree_regularized.close()
print (pd.DataFrame(dt_model.feature_importances_, columns = ["Imp"], index = x_train.columns))
Image("pd_tree_regularized.png")
from sklearn.ensemble import RandomForestClassifier
rfcl = RandomForestClassifier(n_estimators = 50)
rfcl = rfcl.fit(x_train, y_train)
y_pred = rfcl.predict(x_test)
rfcl.score(x_test , y_test)
y_predict = rfcl.predict(x_test)
print(bgcl.score(x_test , y_test))
cm=confusion_matrix(y_test, y_predict,labels=[0, 1])
df_cm = pd.DataFrame(cm, index = [i for i in ["No","Yes"]],
columns = [i for i in ["No","Yes"]])
plt.figure(figsize = (7,5))
ax=sns.heatmap(df_cm, annot=True ,fmt='g')
bottom, top = ax.get_ylim()
ax.set_ylim(bottom + 0.5, top - 0.5)
count_misclassified = (y_test != y_pred).sum()
print('Misclassified samples in Random Forest: {}'.format(count_misclassified))
feature_imp = pd.Series(rfcl.feature_importances_,index=x.columns).sort_values(ascending=False)
feature_imp
sns.barplot(x=feature_imp, y=feature_imp.index)
plt.xlabel('Feature Importance Score')
plt.ylabel('Features')
plt.title("Visualizing Important Features")
plt.show()
from sklearn.ensemble import BaggingClassifier
bgcl = BaggingClassifier(base_estimator=dt_model, n_estimators=50, max_samples=.7)
bgcl = bgcl.fit(x_train, y_train)
y_pred = bgcl.predict(x_test)
bgcl.score(x_test , y_test)
y_predict = bgcl.predict(x_test)
print(bgcl.score(x_test , y_test))
cm=confusion_matrix(y_test, y_predict,labels=[0, 1])
df_cm = pd.DataFrame(cm, index = [i for i in ["No","Yes"]],
columns = [i for i in ["No","Yes"]])
plt.figure(figsize = (7,5))
ax=sns.heatmap(df_cm, annot=True ,fmt='g')
bottom, top = ax.get_ylim()
ax.set_ylim(bottom + 0.5, top - 0.5)
count_misclassified = (y_test != y_pred).sum()
print('Misclassified samples in Bagging: {}'.format(count_misclassified))
Bagging with two base estimators
clf1 = SVC()
clf2 = KNeighborsClassifier(n_neighbors=3)
bagging1 = BaggingClassifier(base_estimator=clf1, n_estimators=10, max_samples=0.8, max_features=0.8)
bagging2 = BaggingClassifier(base_estimator=clf2, n_estimators=10, max_samples=0.8, max_features=0.8)
label = ['SVC', 'K-NN', 'Bagging SVC', 'Bagging K-NN']
clf_list = [clf1, clf2, bagging1, bagging2]
fig = plt.figure(figsize=(10, 8))
gs = gridspec.GridSpec(2, 2)
grid = itertools.product([0,1],repeat=2)
for clf, label, grd in zip(clf_list, label, grid):
scores = cross_val_score(clf, X, y, cv=3, scoring='accuracy')
print ("Accuracy: %.2f (+/- %.2f) [%s]" %(scores.mean(), scores.std(), label))
pca = PCA(n_components = 2)
X_train2 = pca.fit_transform(x_test)
clf.fit(X_train2, y_test)
ax = plt.subplot(gs[grd[0], grd[1]])
fig = plot_decision_regions(np.array(X_train2), np.array(y_test), clf=clf, legend=2)
plt.title(label)
plt.show()
plot_learning_curves(x_train, y_train, x_test, y_test, bagging1, print_model=False, style='ggplot')
plt.show()
from sklearn.ensemble import AdaBoostClassifier
abcl = AdaBoostClassifier( n_estimators= 50)
abcl = abcl.fit(x_train,y_train)
y_pred = abcl.predict(x_test)
abcl.score(x_test , y_test)
y_predict = abcl.predict(x_test)
print(bgcl.score(x_test , y_test))
cm=confusion_matrix(y_test, y_predict,labels=[0, 1])
df_cm = pd.DataFrame(cm, index = [i for i in ["No","Yes"]],
columns = [i for i in ["No","Yes"]])
plt.figure(figsize = (7,5))
ax=sns.heatmap(df_cm, annot=True ,fmt='g')
bottom, top = ax.get_ylim()
ax.set_ylim(bottom + 0.5, top - 0.5)
count_misclassified = (y_test != y_pred).sum()
print('Misclassified samples in Ada Boosting: {}'.format(count_misclassified))
num_est = [1, 2, 3, 10]
label = ['AdaBoost (n_est=1)', 'AdaBoost (n_est=2)', 'AdaBoost (n_est=3)', 'AdaBoost (n_est=10)']
dt_model = DecisionTreeClassifier(criterion='entropy',max_depth=6,random_state=100,min_samples_leaf=5)
dt_model.fit(x_train, y_train)
fig = plt.figure(figsize=(10, 8))
gs = gridspec.GridSpec(2, 2)
grid = itertools.product([0,1],repeat=2)
for n_est, label, grd in zip(num_est, label, grid):
boosting = AdaBoostClassifier(base_estimator=dt_model, n_estimators=n_est)
pca = PCA(n_components = 2)
X_train2 = pca.fit_transform(x_test)
boosting.fit(X_train2, y_test)
ax = plt.subplot(gs[grd[0], grd[1]])
fig = plot_decision_regions(np.array(X_train2), np.array(y_test), clf=boosting, legend=2)
plt.title(label)
plt.show()
boosting = AdaBoostClassifier(base_estimator=dt_model, n_estimators=10)
plt.figure()
plot_learning_curves(x_train, y_train, x_test, y_test, boosting, print_model=False, style='ggplot')
plt.show()
gbcl = GradientBoostingClassifier(n_estimators = 50, learning_rate = 0.05)
gbcl = gbcl.fit(x_train,y_train)
y_pred = gbcl.predict(x_test)
gbcl.score(x_test , y_test)
y_predict = gbcl.predict(x_test)
print(bgcl.score(x_test , y_test))
cm=confusion_matrix(y_test, y_predict,labels=[0, 1])
df_cm = pd.DataFrame(cm, index = [i for i in ["No","Yes"]],
columns = [i for i in ["No","Yes"]])
plt.figure(figsize = (7,5))
ax=sns.heatmap(df_cm, annot=True ,fmt='g')
bottom, top = ax.get_ylim()
ax.set_ylim(bottom + 0.5, top - 0.5)
count_misclassified = (y_test != y_pred).sum()
print('Misclassified samples in Gradient Boosting: {}'.format(count_misclassified))
from sklearn import svm
# lets create an array of classifiers
MLA = [
#Ensemble Methods
ensemble.AdaBoostClassifier(),
ensemble.BaggingClassifier(),
ensemble.ExtraTreesClassifier(),
ensemble.GradientBoostingClassifier(),
ensemble.RandomForestClassifier(),
#Gaussian Processes
gaussian_process.GaussianProcessClassifier(),
#GLM
linear_model.LogisticRegressionCV(),
linear_model.PassiveAggressiveClassifier(),
linear_model.RidgeClassifierCV(),
linear_model.SGDClassifier(),
linear_model.Perceptron(),
#Navies Bayes
naive_bayes.BernoulliNB(),
naive_bayes.GaussianNB(),
#Nearest Neighbor
neighbors.KNeighborsClassifier(n_neighbors=3),
#SVM
svm.SVC(probability=True, kernel="rbf"),
svm.NuSVC(probability=True),
svm.LinearSVC(),
#Trees
tree.DecisionTreeClassifier(max_features =20),
tree.ExtraTreeClassifier(),
]
from sklearn.metrics import mean_squared_error,confusion_matrix, precision_score, recall_score, auc,roc_curve
MLA_columns = []
MLA_compare = pd.DataFrame(columns = MLA_columns)
row_index = 0
for alg in MLA:
predicted = alg.fit(x_train, y_train).predict(x_test)
fp, tp, th = roc_curve(y_test, predicted)
MLA_name = alg.__class__.__name__
MLA_compare.loc[row_index,'MLA Name'] = MLA_name
MLA_compare.loc[row_index, 'MLA Train Accuracy'] = round(alg.score(x_train, y_train), 4)
MLA_compare.loc[row_index, 'MLA Test Accuracy'] = round(alg.score(x_test, y_test), 4)
MLA_compare.loc[row_index, 'MLA Precission'] = precision_score(y_test, predicted)
MLA_compare.loc[row_index, 'MLA Recall'] = recall_score(y_test, predicted)
MLA_compare.loc[row_index, 'MLA AUC'] = auc(fp, tp)
row_index+=1
MLA_compare.sort_values(by = ['MLA Test Accuracy'], ascending = False, inplace = True)
MLA_compare
Observation:
- As seen earlier the KNN is performing best with Test accuracy : 0.9661(with Precision:0.959184,Recall:1.000000 and AUC: 0.916667)
- As shown above the GradientBoostingClassifier tops second in the listing with Test accuracy : 0.9492; (with Precision:0.958333; Recall:0.978723; and AUC: 0.906028)
- the AdaBoostClassifier performes best with Test accuracy :0.9322 and Precesion: 0.977778
plt.figure(figsize= (20,35))
plt.subplot(4,2,1)
ax = sns.barplot(x="MLA Name", y="MLA Train Accuracy",data=MLA_compare,palette='hot',edgecolor=sns.color_palette('dark',7))
plt.xticks(rotation=90)
plt.title('MLA Train Accuracy Comparison')
bottom, top = ax.get_ylim()
ax.set_ylim(bottom + 0.6)
plt.tight_layout()
plt.subplot(4,2,2)
ax = sns.barplot(x="MLA Name", y="MLA Test Accuracy",data=MLA_compare,palette='hot',edgecolor=sns.color_palette('dark',7))
plt.xticks(rotation=90)
plt.title('MLA Test Accuracy Comparison')
bottom, top = ax.get_ylim()
ax.set_ylim(bottom + 0.6)
plt.tight_layout()
plt.subplot(4,2,3)
ax = sns.barplot(x="MLA Name", y="MLA Precission",data=MLA_compare,palette='hot',edgecolor=sns.color_palette('dark',7))
plt.xticks(rotation=90)
plt.title('MLA Precission Comparison')
bottom, top = ax.get_ylim()
ax.set_ylim(bottom + 0.6)
plt.tight_layout()
plt.subplot(4,2,4)
ax = sns.barplot(x="MLA Name", y="MLA Recall",data=MLA_compare,palette='hot',edgecolor=sns.color_palette('dark',7))
plt.xticks(rotation=90)
plt.title('MLA Recall Comparison')
bottom, top = ax.get_ylim()
ax.set_ylim(bottom + 0.6)
plt.tight_layout()
plt.subplot(4,2,5)
ax = sns.barplot(x="MLA Name", y="MLA AUC",data=MLA_compare,palette='hot',edgecolor=sns.color_palette('dark',7))
plt.xticks(rotation=90)
plt.title('MLA AUC Comparison')
bottom, top = ax.get_ylim()
ax.set_ylim(bottom + 0.6)
plt.tight_layout()
index = 1
for alg in MLA:
predicted = alg.fit(x_train, y_train).predict(x_test)
fp, tp, th = roc_curve(y_test, predicted)
roc_auc_mla = auc(fp, tp)
MLA_name = alg.__class__.__name__
plt.plot(fp, tp, lw=2, alpha=0.3, label='ROC %s (AUC = %0.2f)' % (MLA_name, roc_auc_mla))
index+=1
plt.title('ROC Curve comparison')
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.plot([0,1],[0,1],'r--')
plt.xlim([0,1])
plt.ylim([0,1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()
Conclution:
The stacking ensemble is tried, it consists of Naive bayes, Random Forest, and DT base classifiers whose predictions are combined by Logistic Regression as a meta-classifier. We can see the blending of decision boundaries achieved by the stacking classifier.the choice of the base classifiers largely influence the staking ensemble.
Based on the ensemble classifiers, the highest performing algorithms, k Nearest Neighbors(KNN), AdaBoostClassifier, GradientBoostingClassifier are the best for the analysis of the Parkinson’s Disease data set.